This Rmarkdown script (and corresponding TAB-separated CSV input data file InfoRateData.csv and the resulting HTML document) contain the full analysis and plotting code accompanying the paper Human languages share an optimal information transmission rate.
For more information on the data, please see Oh (2015). There are in total 17 languages (see the Table below).
The oral corpus is based on a subset of the Multext (Multilingual Text Tools and Corpora) parallel corpus (Campione & Véronis, 1998) in British English, German, and Italian. The material consists of 15 short texts of 3-5 semantically connected sentences carefully translated by a native speaker in each language.
For the other 14 languages, two of the authors supervised the translation and recording of new datasets. All participants were native speakers of the target language, with a focus on a specific variety of the language when possible – e.g. Mandarin spoken in Beijing, Serbian in Belgrade and Korean in Seoul. No strict control on age or on the speakers’ social diversity was performed, but speakers were mainly students or members of academic institutions. Speakers were asked to read three times (first silently and then loudly twice) each text. The texts were presented one by one on the screen in random order, in a self-paced reading paradigm. This way, speakers familiarized themselves with the text and reduce their reading errors. The second loud recording was analyzed in this study.
Text datasets were acquired from various sources as illustrated in the Table below. After an initial data curation, each dataset was phonetically transcribed and automatically syllabified by a rule-based program written by one of the authors, except in the following cases:
Additionally, no syllabification was required for Sino-Tibetan languages (Cantonese and Mandarin Chinese) since one ideogram corresponds to one syllable.
| Language | Family | ISO 639-3 | Corpus |
|---|---|---|---|
| Basque | Basque | EUS | E-Hitz (Perea et al., 2006) |
| British English | Indo-European | ENG | WebCelex (MPI for Psycholinguistics) |
| Cantonese | Sino-Tibetan | YUE | A linguistic corpus of mid-20th century Hong Kong Cantonese |
| Catalan | Indo-European | CAT | Frequency dictionary (Zséder et al., 2012) |
| Finnish | Uralic | FIN | Finnish Parole Corpus |
| French | Indo-European | FRA | Lexique 3.80 (New et al., 2001) |
| German | Indo-European | DEU | WebCelex (MPI for Psycholinguistics) |
| Hungarian | Uralic | HUN | Hungarian National Corpus (Váradi, 2002) |
| Italian | Indo-European | ITA | The Corpus PAISÀ (Lyding et al., 2014) |
| Japanese | Japanese | JPN | Japanese Internet Corpus (Sharoff, 2006) |
| Korean | Korean | KOR | Leipzig Corpora Collection (LCC) |
| Mandarin Chinese | Sino-Tibetan | CMN | Chinese Internet Corpus (Sharoff, 2006) |
| Serbian | Indo-European | SRP | Frequency dictionary (Zséder et al., 2012) |
| Spanish | Indo-European | SPA | Frequency dictionary (Zséder et al., 2012) |
| Thai | Tai-Kadai | THA | Thai National Corpus (TNC) |
| Turkish | Turkic | TUR | Leipzig Corpora Collection (LCC) |
| Vietnamese | Austroasiatic | VIE | VNSpeechCorpus (Le et al., 2004) |
The data is structured as follows:
NS: exploratory plots.
mean=101.217, median=100, sd=24.694, CV=0.244, min=49, max=162, kurtosis=2.485, skewness=0.178.
SR: exploratory plots.
SR per speaker.
SR by language.
mean=6.612, median=6.76, sd=1.133, CV=0.171, min=3.59, max=9.38, kurtosis=2.382, skewness=-0.197.
ShE and ID: exploratory plots.
ShE vs ID.
Pearson's product-moment correlation
data: tmp1$ShE and tmp1$ID
t = 2.0326, df = 15, p-value = 0.06019
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.02052914 0.77274779
sample estimates:
cor
0.4647009
Spearman's rank correlation rho
data: tmp1$ShE and tmp1$ID
S = 451.88, p-value = 0.07259
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.4462208
Paired t-test
data: tmp1$ShE and tmp1$ID
t = 11.635, df = 16, p-value = 3.213e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.158040 3.119607
sample estimates:
mean of the differences
2.638824
ShE:
mean=8.619, median=8.69, sd=0.906, CV=0.105, min=6.07, max=9.83, kurtosis=4.662, skewness=-1.124.
ID:
mean=6.009, median=5.56, sd=0.883, CV=0.147, min=4.83, max=8.02, kurtosis=2.521, skewness=0.741.
ShIR: exploratory plots.
ShIR per speaker.
ShIR by language.
IR: exploratory plots.
IR per speaker.
IR by language.
ShIR:
mean=56.525, median=56.92, sd=9.217, CV=0.163, min=32.77, max=83.04, kurtosis=2.335, skewness=0.048.
IR:
mean=39.033, median=39.05, sd=4.945, CV=0.127, min=25.64, max=56.3, kurtosis=3.422, skewness=0.241.
SR vs ID
Pearson's product-moment correlation
data: info.rate.data$SR and info.rate.data$ID
t = -46.635, df = 2263, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.7204624 -0.6784227
sample estimates:
cor
-0.7000486
Spearman's rank correlation rho
data: info.rate.data$SR and info.rate.data$ID
S = 3315700000, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.7120735
| Level-1 factor (f) | ICC |
|---|---|
| Text | 0.30 |
| Language | 0.54 |
| Speaker | 0.00 |
| Level-1 factor (f) | ICC |
|---|---|
| Text | 0.01 |
| Language | 0.62 |
| Speaker | 0.29 |
| Level-1 factor (f) | ICC |
|---|---|
| Text | 0.01 |
| Language | 0.56 |
| Speaker | 0.34 |
| Level-1 factor (f) | ICC |
|---|---|
| Text | 0.02 |
| Language | 0.36 |
| Speaker | 0.49 |
| model | AIC | BIC |
|---|---|---|
| 1 + (1 | Text) + (1 | Language) + (1 | Speaker) | 1998.03 | 2026.65 |
| 1 + (1 | Language) + (1 | Speaker) | 2269.3 | 2292.2 |
| 1 + (1 | Text) + (1 | Speaker) | 2131.96 | 2154.86 |
| 1 + (1 | Text) + (1 | Language) | 4490.99 | 4513.89 |
| 1 + Sex + (1 | Text) + (1 | Language) + (1 | Speaker) | 1990.57 | 2024.92 |
| 1 + Sex + (1 | Language) + (1 | Speaker) | 2261.69 | 2290.31 |
| 1 + Sex + (1 | Text) + (1 | Speaker) | 2131.72 | 2160.35 |
| 1 + Sex + (1 | Text) + (1 | Language) | 4364.38 | 4393.01 |
We consider here the full model SR ~ 1 + Sex + (1|Text) + (1|Language) + (1|Speaker).
| model | AIC | BIC |
|---|---|---|
| 1 + (1 | Text) + (1 | Language) + (1 | Speaker) | 10041.03 | 10069.65 |
| 1 + (1 | Language) + (1 | Speaker) | 10307.63 | 10330.53 |
| 1 + (1 | Text) + (1 | Speaker) | 10093.09 | 10115.99 |
| 1 + (1 | Text) + (1 | Language) | 12574.43 | 12597.33 |
| 1 + Sex + (1 | Text) + (1 | Language) + (1 | Speaker) | 10029.53 | 10063.88 |
| 1 + Sex + (1 | Language) + (1 | Speaker) | 10295.98 | 10324.6 |
| 1 + Sex + (1 | Text) + (1 | Speaker) | 10086.35 | 10114.97 |
| 1 + Sex + (1 | Text) + (1 | Language) | 12440.04 | 12468.67 |
We consider here the full model IR ~ 1 + Sex + (1|Text) + (1|Language) + (1|Speaker).
We will use a Gaussian distribution (with fixed or modelled variance).
******************************************************************
Summary of the Quantile Residuals
mean = 6.666774e-05
variance = 1.000442
coef. of skewness = -0.0552563
coef. of kurtosis = 3.323173
Filliben correlation coefficient = 0.9990793
******************************************************************
Deviance= 1048.516
AIC= 1447.603
******************************************************************
Summary of the Quantile Residuals
mean = 0.001738762
variance = 1.000438
coef. of skewness = -0.02112376
coef. of kurtosis = 2.757837
Filliben correlation coefficient = 0.9991447
******************************************************************
Deviance= 719.0835
AIC= 1299.257
The distribution of the residuals is less heteroscedastic than before and the fit to the data better. The full summary of the model is:
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = SR ~ 1 + Sex + random(Text) + random(Language) + random(Speaker), sigma.formula = ~1 + Sex + random(Text) + random(Language) + random(Speaker), family = NO(mu.link = "identity"),
data = d, control = gamlss.control(n.cyc = 800, trace = FALSE), i.control = glim.control(bf.cyc = 800))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.453290 0.007618 847.14 <2e-16 ***
SexM 0.320159 0.011487 27.87 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.30948 0.02098 -62.417 < 2e-16 ***
SexM 0.11194 0.02972 3.767 0.00017 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas:
i) Std. Error for smoothers are for the linear effect only.
ii) Std. Error for the linear terms maybe are not accurate.
------------------------------------------------------------------
No. of observations in the fit: 2265
Degrees of Freedom for the fit: 290.0869
Residual Deg. of Freedom: 1974.913
at cycle: 48
Global Deviance: 719.0835
AIC: 1299.257
SBC: 2960.101
******************************************************************
Text
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 14.3977
Random effect parameter sigma_b: 0.108314
Smoothing parameter lambda : 85.7824
Language
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 16.9836
Random effect parameter sigma_b: 0.87066
Smoothing parameter lambda : 1.32915
Speaker
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 165.2778
Random effect parameter sigma_b: 0.555308
Smoothing parameter lambda : 3.49815
Text
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 0.583681
Random effect parameter sigma_b: 0.0108169
Smoothing parameter lambda : 7523.57
Language
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 14.91065
Random effect parameter sigma_b: 0.161228
Smoothing parameter lambda : 34.0803
Speaker
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 73.93344
Random effect parameter sigma_b: 0.165342
Smoothing parameter lambda : 33.2784
******************************************************************
Summary of the Quantile Residuals
mean = 0.0002550467
variance = 1.000442
coef. of skewness = 0.01302313
coef. of kurtosis = 3.360878
Filliben correlation coefficient = 0.9990567
******************************************************************
Deviance= 9108.279
AIC= 9507.442
******************************************************************
Summary of the Quantile Residuals
mean = 0.001906696
variance = 1.000449
coef. of skewness = -0.02187398
coef. of kurtosis = 2.755234
Filliben correlation coefficient = 0.9991259
******************************************************************
Deviance= 8782.878
AIC= 9364.641
Again, this is a better fit to the data. The full summary of the model is:
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = IR ~ 1 + Sex + random(Text) + random(Language) + random(Speaker), sigma.formula = ~1 + Sex + random(Text) + random(Language) + random(Speaker), family = NO(mu.link = "identity"),
data = d, control = gamlss.control(n.cyc = 800, trace = FALSE), i.control = glim.control(bf.cyc = 800))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.08916 0.04521 842.56 <2e-16 ***
SexM 1.89529 0.06851 27.66 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.47207 0.02098 22.501 < 2e-16 ***
SexM 0.11385 0.02972 3.831 0.000131 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas:
i) Std. Error for smoothers are for the linear effect only.
ii) Std. Error for the linear terms maybe are not accurate.
------------------------------------------------------------------
No. of observations in the fit: 2265
Degrees of Freedom for the fit: 290.8814
Residual Deg. of Freedom: 1974.119
at cycle: 7
Global Deviance: 8782.878
AIC: 9364.641
SBC: 11030.03
******************************************************************
Text
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 14.42046
Random effect parameter sigma_b: 0.658373
Smoothing parameter lambda : 2.32185
Language
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 16.9518
Random effect parameter sigma_b: 3.06988
Smoothing parameter lambda : 0.106913
Speaker
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 165.1517
Random effect parameter sigma_b: 3.31521
Smoothing parameter lambda : 0.0981439
Text
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 0.09852989
Random effect parameter sigma_b: 0.00428681
Smoothing parameter lambda : 47821.8
Language
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 14.96383
Random effect parameter sigma_b: 0.163535
Smoothing parameter lambda : 33.0774
Speaker
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 75.29513
Random effect parameter sigma_b: 0.168066
Smoothing parameter lambda : 32.181
Let’s model SR with ID as an additional predictor (fixed effect). N.B. In this case, we must drop Language as a random effect, since each language has, by definition, only one value of ID.
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = SR ~ 1 + ID + Sex + random(Text) + random(Speaker), sigma.formula = ~1 + ID + Sex + random(Text) + random(Speaker), family = NO(mu.link = "identity"), data = d, control = gamlss.control(n.cyc = 800,
trace = FALSE), i.control = glim.control(bf.cyc = 800))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.794174 0.036777 320.69 <2e-16 ***
ID -0.888238 0.005842 -152.06 <2e-16 ***
SexM 0.310995 0.011382 27.32 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.79946 0.10475 -7.632 3.57e-14 ***
ID -0.08642 0.01705 -5.070 4.35e-07 ***
SexM 0.10951 0.02972 3.685 0.000235 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas:
i) Std. Error for smoothers are for the linear effect only.
ii) Std. Error for the linear terms maybe are not accurate.
------------------------------------------------------------------
No. of observations in the fit: 2265
Degrees of Freedom for the fit: 293.322
Residual Deg. of Freedom: 1971.678
at cycle: 18
Global Deviance: 686.5481
AIC: 1273.192
SBC: 2952.558
******************************************************************
******************************************************************
Summary of the Quantile Residuals
mean = 0.001881969
variance = 1.000438
coef. of skewness = -0.02304788
coef. of kurtosis = 2.677419
Filliben correlation coefficient = 0.9989511
******************************************************************
Deviance= 686.5481
AIC= 1273.192
Adding ID as a predictor improves the fits (as judged by AIC). There is a negative estimate for ID, but significance is difficult to assess with GAMLSS model involving smoothing functions. However, also using a simple lmer model we have a significant effect of ID:
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
ID 16.4136 16.4136 1 166.55 162.5728 < 2.2e-16 ***
Sex 0.8047 0.8047 1 166.63 7.9707 0.005334 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Text
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 14.42354
Random effect parameter sigma_b: 0.10986
Smoothing parameter lambda : 83.3869
Speaker
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 167.487
Random effect parameter sigma_b: 0.74601
Smoothing parameter lambda : 1.94032
Text
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 1.916828
Random effect parameter sigma_b: 0.0201916
Smoothing parameter lambda : 2061.61
Speaker
Random effects fit using the gamlss function random()
Degrees of Freedom for the fit : 103.4947
Random effect parameter sigma_b: 0.234219
Smoothing parameter lambda : 16.0415
In what follows, mixing probabilities are independent from factors such as Sex.
Between 1 and 5 Gaussian distributions:
1 component
Mixing Family: "NO"
Fitting method: EM algorithm
Call: gamlssMX(formula = SR ~ 1, family = NO, K = 1, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
6.612
Sigma Coefficients for model: 1
(Intercept)
0.1248
Estimated probabilities: 1
Degrees of Freedom for the fit: 2 Residual Deg. of Freedom 2263
Global Deviance: 6993.16
AIC: 6997.16
SBC: 7008.61
2 components
Mixing Family: c("NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = SR ~ 1, family = NO, K = 2, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
5.293
Sigma Coefficients for model: 1
(Intercept)
-0.4553
Mu Coefficients for model: 2
(Intercept)
7.157
Sigma Coefficients for model: 2
(Intercept)
-0.2298
Estimated probabilities: 0.2925825 0.7074175
Degrees of Freedom for the fit: 5 Residual Deg. of Freedom 2260
Global Deviance: 6861.89
AIC: 6871.89
SBC: 6900.52
3 components
Mixing Family: c("NO", "NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = SR ~ 1, family = NO, K = 3, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
5.23
Sigma Coefficients for model: 1
(Intercept)
-0.4888
Mu Coefficients for model: 2
(Intercept)
6.612
Sigma Coefficients for model: 2
(Intercept)
-0.1508
Mu Coefficients for model: 3
(Intercept)
7.271
Sigma Coefficients for model: 3
(Intercept)
-0.263
Estimated probabilities: 0.2537915 0.2147315 0.531477
Degrees of Freedom for the fit: 8 Residual Deg. of Freedom 2257
Global Deviance: 6862.04
AIC: 6878.04
SBC: 6923.84
4 components
Mixing Family: c("NO", "NO", "NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = SR ~ 1, family = NO, K = 4, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
7.171
Sigma Coefficients for model: 1
(Intercept)
-0.2973
Mu Coefficients for model: 2
(Intercept)
5.092
Sigma Coefficients for model: 2
(Intercept)
-0.5485
Mu Coefficients for model: 3
(Intercept)
5.806
Sigma Coefficients for model: 3
(Intercept)
-0.454
Mu Coefficients for model: 4
(Intercept)
7.316
Sigma Coefficients for model: 4
(Intercept)
-0.2168
Estimated probabilities: 0.4443139 0.1849648 0.1507718 0.2199495
Degrees of Freedom for the fit: 11 Residual Deg. of Freedom 2254
Global Deviance: 6861.1
AIC: 6883.1
SBC: 6946.08
5 components
Mixing Family: c("NO", "NO", "NO", "NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = SR ~ 1, family = NO, K = 5, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
6.375
Sigma Coefficients for model: 1
(Intercept)
-0.3485
Mu Coefficients for model: 2
(Intercept)
7.228
Sigma Coefficients for model: 2
(Intercept)
-1.001
Mu Coefficients for model: 3
(Intercept)
7.969
Sigma Coefficients for model: 3
(Intercept)
-0.5569
Mu Coefficients for model: 4
(Intercept)
6.248
Sigma Coefficients for model: 4
(Intercept)
-0.4579
Mu Coefficients for model: 5
(Intercept)
5.041
Sigma Coefficients for model: 5
(Intercept)
-0.6049
Estimated probabilities: 0.1788327 0.2320332 0.1993053 0.2001806 0.1896483
Degrees of Freedom for the fit: 14 Residual Deg. of Freedom 2251
Global Deviance: 6844.22
AIC: 6872.22
SBC: 6952.37
Comparing AIC
df AIC
mix.SR.NO.2 5 6871.892
mix.SR.NO.5 14 6872.218
mix.SR.NO.3 8 6878.040
mix.SR.NO.4 11 6883.103
mix.SR.NO.1 2 6997.163
Showing the distributions
Mixture of Gaussians for SR.
Between 1 and 5 Gaussian distributions:
1 component
Mixing Family: "NO"
Fitting method: EM algorithm
Call: gamlssMX(formula = IR ~ 1, family = NO, K = 1, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
39.03
Sigma Coefficients for model: 1
(Intercept)
1.598
Estimated probabilities: 1
Degrees of Freedom for the fit: 2 Residual Deg. of Freedom 2263
Global Deviance: 13667
AIC: 13671
SBC: 13682.5
2 components
Mixing Family: c("NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = IR ~ 1, family = NO, K = 2, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
39.76
Sigma Coefficients for model: 1
(Intercept)
1.766
Mu Coefficients for model: 2
(Intercept)
38.36
Sigma Coefficients for model: 2
(Intercept)
1.336
Estimated probabilities: 0.4795038 0.5204962
Degrees of Freedom for the fit: 5 Residual Deg. of Freedom 2260
Global Deviance: 13637.8
AIC: 13647.8
SBC: 13676.4
3 components
Mixing Family: c("NO", "NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = IR ~ 1, family = NO, K = 3, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
35.42
Sigma Coefficients for model: 1
(Intercept)
1.339
Mu Coefficients for model: 2
(Intercept)
39.78
Sigma Coefficients for model: 2
(Intercept)
0.89
Mu Coefficients for model: 3
(Intercept)
42.22
Sigma Coefficients for model: 3
(Intercept)
1.636
Estimated probabilities: 0.3631272 0.2946715 0.3422013
Degrees of Freedom for the fit: 8 Residual Deg. of Freedom 2257
Global Deviance: 13618.4
AIC: 13634.4
SBC: 13680.2
4 components
Mixing Family: c("NO", "NO", "NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = IR ~ 1, family = NO, K = 4, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
42.31
Sigma Coefficients for model: 1
(Intercept)
1.699
Mu Coefficients for model: 2
(Intercept)
39.43
Sigma Coefficients for model: 2
(Intercept)
0.6302
Mu Coefficients for model: 3
(Intercept)
40.07
Sigma Coefficients for model: 3
(Intercept)
1.316
Mu Coefficients for model: 4
(Intercept)
34.48
Sigma Coefficients for model: 4
(Intercept)
1.25
Estimated probabilities: 0.2568272 0.1726828 0.3017301 0.2687599
Degrees of Freedom for the fit: 11 Residual Deg. of Freedom 2254
Global Deviance: 13613.6
AIC: 13635.6
SBC: 13698.6
5 components
Mixing Family: c("NO", "NO", "NO", "NO", "NO")
Fitting method: EM algorithm
Call: gamlssMX(formula = IR ~ 1, family = NO, K = 5, data = d, plot = FALSE)
Mu Coefficients for model: 1
(Intercept)
42.85
Sigma Coefficients for model: 1
(Intercept)
1.694
Mu Coefficients for model: 2
(Intercept)
37.38
Sigma Coefficients for model: 2
(Intercept)
1.457
Mu Coefficients for model: 3
(Intercept)
39.58
Sigma Coefficients for model: 3
(Intercept)
0.4193
Mu Coefficients for model: 4
(Intercept)
40.57
Sigma Coefficients for model: 4
(Intercept)
1.308
Mu Coefficients for model: 5
(Intercept)
35.24
Sigma Coefficients for model: 5
(Intercept)
1.337
Estimated probabilities: 0.2128878 0.2184185 0.1195341 0.2222382 0.2269214
Degrees of Freedom for the fit: 14 Residual Deg. of Freedom 2251
Global Deviance: 13612.4
AIC: 13640.4
SBC: 13720.5
Comparing AIC
df AIC
mix.IR.NO.3 8 13634.37
mix.IR.NO.4 11 13635.64
mix.IR.NO.5 14 13640.36
mix.IR.NO.2 5 13647.78
mix.IR.NO.1 2 13671.04
Showing the distributions
Mixture of Gaussians for IR.
We used three ways to estimate how unimodal a distribution is, as they tend to disagree and the problem of unimodality testing is far from settled (see Freeman & Dale, 2013):
diptest;For each such test, we performed four randomisation procedures to obtain an estimate of the “specialness” of the observed unimodality estimate; for each new permuted dataset, we recompute everything before estimating the unimodlaity of the permuted distribution:
The observed estimate (vertical blue solid line), the permuted distribution (gray histogram), and the “unimodality region” (shaded green rectangle) are shown below (for PM3, we also show the original estimate using the Speaker average SR as a vertical solid red line).
Permutation of the texts’ SRs (PM1).
Permutation of the languages’ ID (PM2).
Permutation of the speakers’ average SRs (PM3).
Permutation of the languages’ average SRs with speaker adjustement (PM4).
| Scenario | Measure | Test | Observed estimate (p-value) | % more unimodal permutations |
|---|---|---|---|---|
| PM1 | SR | Silverman | - (0.019) | 74.2% |
| PM1 | SR | Dip | 0.006 (0.924) * | 100% |
| PM1 | SR | BC | 0.193 () * | 100% |
| PM1 | IR | Silverman | - (0.17) * | 86.8% |
| PM1 | IR | Dip | 0.005 (0.994) * | 50% |
| PM1 | IR | BC | 0.165 () * | 100% |
| PM2 | SR | Silverman | - (0.019) | 75.2% |
| PM2 | SR | Dip | 0.006 (0.924) * | 100% |
| PM2 | SR | BC | 0.193 () * | 100% |
| PM2 | IR | Silverman | - (0.17) * | 22.6% |
| PM2 | IR | Dip | 0.005 (0.994) * | 14.8% |
| PM2 | IR | BC | 0.165 () * | 97.7% |
| PM3 | SR | Silverman | - (0.019) | 100% |
| PM3 | SR | Dip | 0.006 (0.924) * | 0% |
| PM3 | SR | BC | 0.193 () * | 100% |
| PM3 | IR | Silverman | - (0.17) * | 86.2% |
| PM3 | IR | Dip | 0.005 (0.994) * | 6% |
| PM3 | IR | BC | 0.165 () * | 100% |
| PM4 | SR | Silverman | - (0.019) | 5.2% |
| PM4 | SR | Dip | 0.006 (0.924) * | 3.1% |
| PM4 | SR | BC | 0.193 () * | 75.5% |
| PM4 | IR | Silverman | - (0.17) * | 11.1% |
| PM4 | IR | Dip | 0.005 (0.994) * | 5.9% |
| PM4 | IR | BC | 0.165 () * | 97.3% |
We compute various distances between languages (as implemented by function distance() in package philentropy) in what concerns the distribution of NS, SR and ID.
Comparing the distribution of pairwise distances between languages.
| m1 | m2 | d | mean1 | median1 | sd1 | mean2 | median2 | sd2 | p |
|---|---|---|---|---|---|---|---|---|---|
| IR | NS | Hellinger | 0.89 | 0.84 | 0.32 | 1.20 | 1.19 | 0.39 | 0.00 |
| IR | NS | Jensen-Shannon | 0.18 | 0.14 | 0.12 | 0.29 | 0.26 | 0.16 | 0.00 |
| IR | NS | Kolmogorov–Smirnov | 0.42 | 0.37 | 0.20 | 0.57 | 0.57 | 0.23 | 0.00 |
| IR | NS | Kullback-Leibler | 7.44 | 4.77 | 7.88 | 15.43 | 13.12 | 13.30 | 0.00 |
| IR | NS | Squared-Chi | 0.57 | 0.46 | 0.36 | 0.88 | 0.80 | 0.47 | 0.00 |
| IR | SR | Hellinger | 0.89 | 0.84 | 0.32 | 1.13 | 1.09 | 0.49 | 0.00 |
| IR | SR | Jensen-Shannon | 0.18 | 0.14 | 0.12 | 0.28 | 0.24 | 0.20 | 0.00 |
| IR | SR | Kolmogorov–Smirnov | 0.42 | 0.37 | 0.20 | 0.57 | 0.56 | 0.27 | 0.00 |
| IR | SR | Kullback-Leibler | 7.44 | 4.77 | 7.88 | 15.90 | 10.08 | 16.29 | 0.00 |
| IR | SR | Squared-Chi | 0.57 | 0.46 | 0.36 | 0.88 | 0.78 | 0.58 | 0.00 |
| NS | SR | Hellinger | 1.20 | 1.19 | 0.39 | 1.13 | 1.09 | 0.49 | 0.08 |
| NS | SR | Jensen-Shannon | 0.29 | 0.26 | 0.16 | 0.28 | 0.24 | 0.20 | 0.65 |
| NS | SR | Kolmogorov–Smirnov | 0.57 | 0.57 | 0.23 | 0.57 | 0.56 | 0.27 | 0.99 |
| NS | SR | Kullback-Leibler | 15.43 | 13.12 | 13.30 | 15.90 | 10.08 | 16.29 | 0.75 |
| NS | SR | Squared-Chi | 0.88 | 0.80 | 0.47 | 0.88 | 0.78 | 0.58 | 0.93 |
Campione, E., & Véronis, J. (1998). A multilingual prosodic database, Proc. of the 5th International Conference on Spoken Language Pro cessing (ICSLP’98), Sydney, Australia, 3163-3166.
Freeman, J. B., & Dale, R. (2013). Assessing bimodality to detect the presence of a dual cognitive process. Behavior research methods, 45(1), 83-97.
Hall, P., & York, M. (2001). On the calibration of Silverman’s test for multimodality. Statistica Sinica, 11, 515-536.
Hartigan, J. A., & Hartigan, P. M. (1985) The Dip Test of Unimodality. Annals of Statistics 13, 70–84.
Hartigan, P. M. (1985) Computation of the Dip Statistic to Test for Unimodality. Applied Statistics (JRSS C) 34, 320–325.
Le, V. B., Tran, D. D., Castelli, E., Besacier, L., & Serignat, J. F. (2004). Spoken and Written Language Resources for Vietnamese. In LREC. 4, pp. 599-602.
Lyding, V., Stemle, E., Borghetti, C., Brunello, M., Castagnoli, S., Dell’Orletta, F., Dittmann, H., Lenci, A., & Pirrelli, V. (2014). The PAISÀ Corpus of Italian Web Texts. In Proceedings of the 9th Web as Corpus Workshop (WaC-9). Association for Computational Linguistics, Gothenburg, Sweden, 36-43.
New B., Pallier C., Ferrand L., & Matos R. (2001). Une base de données lexicales du français contemporain sur internet: LEXIQUE 3.80, L’Année Psychologique, 101, 447-462. http://www.lexique.org.
Oh, Y. M. (2015). Linguistic complexity and information: quantitative approaches. PhD Thesis, Université de Lyon, France. Retrieved from http://www.afcp-parole.org/doc/theses/these_YMO15.pdf
Perea, M., Urkia, M., Davis, C. J., Agirre, A., Laseka, E., & Carreiras, M. (2006). E-Hitz: A word frequency list and a program for deriving psycholinguistic statistics in an agglutinative language (Basque). Behavior Research Methods, 38(4), 610-615.
Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In Baroni, M. and Bernardini, S. (Eds.) WaCky! Working papers on the web as corpus, Gedit, Bologna, http://corpus.leeds.ac.uk/queryzh.html.
Silverman, B.W. (1981). Using Kernel Density Estimates to investigate Multimodality. Journal of the Royal Statistical Society, Series B, 43, 97-99.
Váradi, T. (2002). The Hungarian National Corpus. In LREC.
Zséder, A., Recski, G., Varga, D., & Kornai, A. (2012). Rapid creation of large-scale corpora and frequency dictionaries. In Proceedings to LREC 2012.
R session infoThis document was compiled on:
R version 3.4.4 (2018-03-15)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: grid, compiler, parallel, splines, stats, graphics, grDevices, datasets, utils, methods and base
other attached packages: broman(v.0.68-2), philentropy(v.0.3.0), diptest(v.0.75-7), pander(v.0.6.3), moments(v.0.14), sjPlot(v.2.6.2), sjstats(v.0.17.3), gamlss.mx(v.4.3-5), nnet(v.7.3-12), gamlss(v.5.1-2), nlme(v.3.1-137), gamlss.dist(v.5.1-1), MASS(v.7.3-51.1), gamlss.data(v.5.1-0), lmerTest(v.3.1-0), lme4(v.1.1-20), Matrix(v.1.2-15), plyr(v.1.8.4), reshape2(v.1.4.3), ggplot2(v.3.1.0) and RhpcBLASctl(v.0.18-205)
loaded via a namespace (and not attached): numDeriv(v.2016.8-1), tools(v.3.4.4), TMB(v.1.7.15), backports(v.1.1.3), R6(v.2.4.0), sjlabelled(v.1.0.16), lazyeval(v.0.2.1), colorspace(v.1.4-0), withr(v.2.1.2), tidyselect(v.0.2.5), mnormt(v.1.5-5), emmeans(v.1.3.2), sandwich(v.2.5-0), labeling(v.0.3), scales(v.1.0.0), mvtnorm(v.1.0-8), psych(v.1.8.12), ggridges(v.0.5.1), stringr(v.1.4.0), digest(v.0.6.18), foreign(v.0.8-71), minqa(v.1.2.4), rmarkdown(v.1.11), stringdist(v.0.9.5.1), pkgconfig(v.2.0.2), htmltools(v.0.3.6), highr(v.0.7), pwr(v.1.2-2), rlang(v.0.3.1), generics(v.0.0.2), zoo(v.1.8-4), dplyr(v.0.8.0.1), magrittr(v.1.5), modeltools(v.0.2-22), bayesplot(v.1.6.0), Rcpp(v.1.0.0), munsell(v.0.5.0), prediction(v.0.3.6.2), stringi(v.1.3.1), multcomp(v.1.4-8), yaml(v.2.2.0), snakecase(v.0.9.2), sjmisc(v.2.7.7), forcats(v.0.4.0), crayon(v.1.3.4), lattice(v.0.20-38), ggeffects(v.0.8.0), haven(v.2.1.0), hms(v.0.4.2), knitr(v.1.21), pillar(v.1.3.1), estimability(v.1.3), codetools(v.0.2-16), stats4(v.3.4.4), glue(v.1.3.0), evaluate(v.0.13), data.table(v.1.12.0), modelr(v.0.1.4), nloptr(v.1.2.1), gtable(v.0.2.0), purrr(v.0.3.0), tidyr(v.0.8.2), assertthat(v.0.2.0), xfun(v.0.5), coin(v.1.2-2), xtable(v.1.8-3), broom(v.0.5.1), coda(v.0.19-2), survival(v.2.43-3), tibble(v.2.0.1), glmmTMB(v.0.2.3) and TH.data(v.1.0-10)
Here we generate the figures used in the main paper (saved to the ./figures folder as 600 DPI TIFF files Figure-*.tiff).